Link to the github io page: here
Convolution Neural Net (CNN) has become the main machine learning tool on image datasets recently. Given an image, the convolution layer are good at learning the image basic features like lines, curves and edges using kernel learnt through supervised learning. These kernels covolve around each image produces lower level (more abstract) representation called channels.
Here is a simple demo of 3x3 kernel (in dark blue) convolving around the 5x5 image(in blue) to produce a 3x3 channel (in white).
"
Usually, we use multiple kernels in each colvolution layer resulting to multiple channels. So, if there are multiple layers, the channels count will be large, making our data very high dimensional.
https://arxiv.org/pdf/2011.08785.pdf in this paper, I was first introduced to the idea of using random subsets of channels and still obtaining the performance similar to while using all the channels. In the paper, they try to localize all the anomaly regions in an image. Although the scope of our project is just to perform binary clssification, I am stil really curious if I could use data science approach to figure if I could pick the best subsets or maybe reduce the overall data size.
Before I explain about our dataframe, this is the what the raw images look like.
This is a normal transistor image, and referred as "normal image" in this notebook.
These are transistor images with anomalies, and referred as "anomaly image" in this notebook.
We applied image augmentation techniques to generate more images. The main task here is to classify if a given image is normal or anomaly.
However, we are not using these raw images for our data science project. In fact, we will pass each image through a CNN model and use the channels as our unit data. To be more specific, we use RESNET pretrained model and the channel output of the third layer as our data.
Below is the RESNET architecture and we can observe that layer 3 has output of shape 14x14x256, where each channel is of size 14x14.
To reduce the channel dimension, we average (global average pooling) each channel to make the final shape as 1x256. Hence, our each image is represented by vector of shape 1x256.
We will use data science approach to figure how to select the channels to maintain similar performance as to while using all the channels for classification task. We can also look for what dimension reduction works best for if the model to be used is simple logistic regression model.
import random
from random import sample
import numpy as np
import os
from tqdm import tqdm
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
from scipy.spatial.distance import mahalanobis
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score,f1_score
from sklearn.covariance import LedoitWolf
import warnings
warnings.filterwarnings('ignore')
I have stored two numpy arrays and each for normal and anomaly images.
normal_data = np.load("normal.npy")
anomaly_data = np.load("anomaly.npy")
# Get the columns name
cols = ["c" + str(i) for i in range(256)]
# Create pandas dataframe
df = pd.DataFrame(np.concatenate((normal_data, anomaly_data), axis=0), columns = cols)
# Assign label = 0 for normal images and label = 1 for anomaly
df['label'] = [0] * normal_data.shape[0] + [1] * anomaly_data.shape[0]
df.head()
| c0 | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 | ... | c247 | c248 | c249 | c250 | c251 | c252 | c253 | c254 | c255 | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.158984 | 0.056839 | 0.081127 | 0.046833 | 0.121052 | 0.127416 | 0.058033 | 0.105025 | 0.024803 | 0.038923 | ... | 0.064348 | 0.154216 | 0.101745 | 0.062601 | 0.105468 | 0.301576 | 0.097783 | 0.187822 | 0.100279 | 0 |
| 1 | 0.148324 | 0.037281 | 0.087455 | 0.035361 | 0.105669 | 0.196064 | 0.063659 | 0.178187 | 0.026070 | 0.009796 | ... | 0.072462 | 0.113702 | 0.164750 | 0.080955 | 0.066059 | 0.300927 | 0.104027 | 0.174209 | 0.047375 | 0 |
| 2 | 0.162493 | 0.054301 | 0.053285 | 0.055391 | 0.119270 | 0.128621 | 0.080290 | 0.107885 | 0.030830 | 0.031822 | ... | 0.060907 | 0.144773 | 0.124535 | 0.051645 | 0.102880 | 0.312054 | 0.080980 | 0.152039 | 0.118316 | 0 |
| 3 | 0.158661 | 0.073720 | 0.071030 | 0.039224 | 0.101606 | 0.083979 | 0.060536 | 0.126996 | 0.035288 | 0.026823 | ... | 0.045971 | 0.128997 | 0.078799 | 0.059057 | 0.091431 | 0.327879 | 0.086462 | 0.174518 | 0.107508 | 0 |
| 4 | 0.150962 | 0.048356 | 0.060710 | 0.064295 | 0.136142 | 0.139862 | 0.080558 | 0.127039 | 0.027589 | 0.024897 | ... | 0.059857 | 0.135180 | 0.113146 | 0.073884 | 0.080941 | 0.390073 | 0.075463 | 0.166586 | 0.076868 | 0 |
5 rows × 257 columns
df.describe()
| c0 | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 | ... | c247 | c248 | c249 | c250 | c251 | c252 | c253 | c254 | c255 | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 | ... | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 | 1805.000000 |
| mean | 0.157616 | 0.054395 | 0.085359 | 0.054675 | 0.115474 | 0.124441 | 0.098844 | 0.143318 | 0.034238 | 0.037625 | ... | 0.092088 | 0.096349 | 0.124493 | 0.083080 | 0.087279 | 0.364936 | 0.100797 | 0.188747 | 0.095412 | 0.243767 |
| std | 0.028883 | 0.016657 | 0.034048 | 0.027661 | 0.022775 | 0.019726 | 0.038861 | 0.038466 | 0.012033 | 0.028445 | ... | 0.043415 | 0.045875 | 0.027045 | 0.027615 | 0.013125 | 0.053219 | 0.023355 | 0.037013 | 0.019430 | 0.429473 |
| min | 0.060056 | 0.000407 | 0.023670 | 0.005960 | 0.016403 | 0.061239 | 0.039680 | 0.076324 | 0.000312 | 0.000411 | ... | 0.018627 | 0.001576 | 0.046374 | 0.014374 | 0.037666 | 0.216784 | 0.018390 | 0.105922 | 0.036836 | 0.000000 |
| 25% | 0.137777 | 0.043657 | 0.064352 | 0.038185 | 0.104442 | 0.112087 | 0.075920 | 0.113100 | 0.025964 | 0.021286 | ... | 0.060066 | 0.058490 | 0.104350 | 0.063856 | 0.078192 | 0.330996 | 0.085766 | 0.165368 | 0.082994 | 0.000000 |
| 50% | 0.161338 | 0.051488 | 0.073105 | 0.047939 | 0.118482 | 0.122699 | 0.090073 | 0.130767 | 0.033380 | 0.028622 | ... | 0.073684 | 0.108611 | 0.119653 | 0.076323 | 0.087256 | 0.367339 | 0.099524 | 0.179535 | 0.093917 | 0.000000 |
| 75% | 0.178718 | 0.062503 | 0.088412 | 0.060845 | 0.131367 | 0.134876 | 0.105640 | 0.170695 | 0.041085 | 0.038684 | ... | 0.123349 | 0.134120 | 0.144549 | 0.094636 | 0.096132 | 0.391594 | 0.115847 | 0.202120 | 0.106301 | 0.000000 |
| max | 0.254942 | 0.121560 | 0.203937 | 0.193614 | 0.174762 | 0.205209 | 0.323324 | 0.288091 | 0.091181 | 0.168471 | ... | 0.258239 | 0.211719 | 0.245009 | 0.190859 | 0.132892 | 0.745504 | 0.186764 | 0.321174 | 0.198278 | 1.000000 |
8 rows × 257 columns
df['label'].value_counts()
0 1365 1 440 Name: label, dtype: int64
So, small imbalance here, but should not be big issue.
We first check if the columns are correlated since corrleation allows to use linear dimensionality reduction techniques like PCA or we can even simply remove columns with high correlations
f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number, cmap = "coolwarm")
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);
We first try with neural network since it usually perform better for high dimensional data.
# Function to return split the data with or without scaling
def get_train_test_split(X = None, y = None, rand_int = 42, scale = False):
if X is None and y is None:
X = np.concatenate((normal_data, anomaly_data), axis=0)
y = [0] * normal_data.shape[0] + [1] * anomaly_data.shape[0]
X_train, X_test, y_train, y_test = train_test_split(X,y,
stratify=y,
shuffle = True,
random_state = rand_int,
test_size=0.2)
if scale:
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
return X_train_scaled, X_test_scaled, y_train, y_test
else:
return X_train, X_test, y_train, y_test
Use all the columns
We train a model with all features and find average train and test score.
# List to store the scores
scores = []
# Train and test data
X_train, X_test, y_train, y_test = get_train_test_split()
for i in range(10):
nn = MLPClassifier(hidden_layer_sizes= (20,20), alpha = 0.01,max_iter=1000)
# Train the model
nn.fit(X_train, y_train)
# Train score
train_score = nn.score(X_train, y_train)
# test score
test_score = nn.score(X_test, y_test)
scores.append([train_score,test_score ])
# Find mean and covariance and standard deviation
_cov = np.cov(scores, rowvar=False)
mean = np.mean(scores, axis=0)
stand_dev = [_cov[i][i] for i in range(_cov.shape[0])]
stand_dev = np.sqrt(stand_dev)
print("mean score is ", mean)
print("standard deviation is ", stand_dev)
mean score is [0.9367036 0.89944598] standard deviation is [0.01554774 0.01197169]
Use 80 random columns
Here, we randomly select 80 columns and check how the performance changes.
def random_NN(X_train, y_train, X_test, y_test, rand_columns_count, hidden_layer_sizes = (20,20), loop_count = 10):
scores = []
for i in range(loop_count):
col_count = X_train.shape[1]
rand_cols = random.sample(range(col_count),rand_columns_count)
rand_cols = sorted(rand_cols)
nn = MLPClassifier(hidden_layer_sizes = hidden_layer_sizes, alpha = 0.001,max_iter=1000)
# Train the model
nn.fit(X_train[:, rand_cols], y_train)
# Train score
train_score = nn.score(X_train[:, rand_cols], y_train)
# test score
test_score = nn.score(X_test[:, rand_cols], y_test)
scores.append([train_score,test_score ])
# Find mean and covariance and standard deviation
_cov = np.cov(scores, rowvar=False)
mean = np.mean(scores, axis=0)
stand_dev = [_cov[i][i] for i in range(_cov.shape[0])]
stand_dev = np.sqrt(stand_dev)
return scores, mean, stand_dev
scores, mean, stand_dev = random_NN(X_train, y_train, X_test, y_test,
rand_columns_count = 80, hidden_layer_sizes = (20,20), loop_count = 10)
print("mean score is ", mean)
print("standard deviation is ", stand_dev)
scores
mean score is [0.90547091 0.87534626] standard deviation is [0.00864496 0.01905796]
[[0.9023545706371191, 0.8337950138504155], [0.9092797783933518, 0.8836565096952909], [0.9030470914127424, 0.8753462603878116], [0.8981994459833795, 0.8670360110803325], [0.9127423822714681, 0.8919667590027701], [0.8878116343490304, 0.8698060941828255], [0.9120498614958449, 0.8614958448753463], [0.917590027700831, 0.8781163434903048], [0.9016620498614959, 0.8919667590027701], [0.909972299168975, 0.9002770083102493]]
These are the train and test scores at each instance
What's the point of this comparision?
As we can see, there is not much difference in the score with just 80 columns here. That means, there is way we can evaluate what the minimum number of columns can be to get reasonable performance while using the simplest classification model. Here, we used a Neural Net with 2 hidden layers since neural net performs well with large dimensional data. Do you think we can do better??
If two or more columns are highly correlated, we can keep one of them and remove the rest.
# This function drops columns with high correlation
def remove_collinear_features(x, threshold):
'''
Objective:
Remove collinear features in a dataframe with a correlation coefficient
greater than the threshold. Removing collinear features can help a model
to generalize and improves the interpretability of the model.
Inputs:
x: features dataframe
threshold: features with correlations greater than this value are removed
Output:
dataframe that contains only the non-highly-collinear features
'''
# Calculate the correlation matrix
corr_matrix = x.corr()
iters = range(len(corr_matrix.columns) - 1)
drop_cols = []
# Iterate through the correlation matrix and compare correlations
for i in iters:
for j in range(i+1):
item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
col = item.columns
row = item.index
val = abs(item.values)
# If correlation exceeds the threshold
if val >= threshold:
# Print the correlated features and the correlation value
print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
drop_cols.append(col.values[0])
# Drop one of each pair of correlated columns
drops = set(drop_cols)
x = x.drop(columns=drops)
return x
#remove features with colinearity above 0.7 to prevent overfitting
df_new = remove_collinear_features(df.iloc[:,0:-1], 0.7)
c3 | c2 | 0.72 c9 | c2 | 0.81 c9 | c3 | 0.8 c14 | c3 | 0.71 c14 | c9 | 0.77 c20 | c2 | 0.7 c20 | c3 | 0.72 c20 | c9 | 0.76 c24 | c19 | 0.75 c25 | c3 | 0.72 c25 | c9 | 0.77 c25 | c14 | 0.73 c25 | c20 | 0.76 c30 | c2 | 0.73 c30 | c9 | 0.73 c30 | c16 | 0.74 c31 | c7 | 0.72 c40 | c19 | 0.71 c41 | c25 | 0.71 c43 | c34 | 0.75 c57 | c6 | 0.74 c57 | c43 | 0.7 c58 | c19 | 0.77 c59 | c6 | 0.72 c59 | c18 | 0.73 c59 | c43 | 0.78 c59 | c54 | 0.74 c60 | c2 | 0.74 c60 | c21 | 0.75 c61 | c43 | 0.72 c61 | c59 | 0.72 c62 | c9 | 0.7 c63 | c2 | 0.71 c63 | c9 | 0.8 c67 | c63 | 0.74 c70 | c2 | 0.82 c70 | c3 | 0.8 c70 | c9 | 0.87 c70 | c30 | 0.77 c70 | c63 | 0.76 c70 | c64 | 0.73 c72 | c7 | 0.73 c72 | c43 | 0.76 c72 | c59 | 0.77 c72 | c70 | 0.71 c73 | c59 | 0.79 c77 | c9 | 0.74 c80 | c9 | 0.73 c80 | c16 | 0.72 c80 | c30 | 0.71 c80 | c70 | 0.73 c89 | c26 | 0.72 c92 | c67 | 0.75 c92 | c91 | 0.76 c95 | c2 | 0.79 c95 | c9 | 0.74 c95 | c30 | 0.74 c95 | c70 | 0.81 c104 | c2 | 0.84 c104 | c9 | 0.74 c104 | c28 | 0.71 c104 | c30 | 0.71 c104 | c70 | 0.75 c104 | c88 | 0.71 c104 | c92 | 0.73 c104 | c95 | 0.73 c107 | c58 | 0.71 c107 | c103 | 0.74 c109 | c2 | 0.78 c109 | c3 | 0.78 c109 | c9 | 0.82 c109 | c14 | 0.74 c109 | c20 | 0.71 c109 | c25 | 0.74 c109 | c60 | 0.85 c109 | c62 | 0.71 c109 | c70 | 0.75 c110 | c109 | 0.72 c111 | c7 | 0.72 c111 | c28 | 0.78 c111 | c104 | 0.75 c113 | c2 | 0.72 c113 | c9 | 0.73 c113 | c30 | 0.74 c113 | c63 | 0.71 c113 | c70 | 0.72 c121 | c43 | 0.72 c121 | c59 | 0.7 c121 | c94 | 0.7 c125 | c2 | 0.74 c125 | c9 | 0.74 c125 | c64 | 0.74 c125 | c70 | 0.78 c125 | c72 | 0.76 c125 | c95 | 0.76 c125 | c109 | 0.74 c129 | c58 | 0.72 c131 | c6 | 0.79 c131 | c43 | 0.73 c131 | c57 | 0.76 c131 | c59 | 0.81 c131 | c121 | 0.76 c133 | c6 | 0.75 c133 | c43 | 0.75 c133 | c59 | 0.79 c133 | c131 | 0.79 c134 | c91 | 0.74 c134 | c107 | 0.75 c140 | c30 | 0.73 c140 | c70 | 0.7 c143 | c3 | 0.71 c143 | c9 | 0.77 c143 | c14 | 0.74 c143 | c20 | 0.71 c143 | c25 | 0.78 c143 | c70 | 0.71 c143 | c109 | 0.75 c144 | c2 | 0.84 c144 | c9 | 0.76 c144 | c70 | 0.74 c144 | c95 | 0.71 c144 | c104 | 0.8 c144 | c109 | 0.71 c147 | c7 | 0.83 c147 | c28 | 0.7 c147 | c43 | 0.74 c147 | c59 | 0.81 c147 | c61 | 0.71 c147 | c67 | 0.83 c147 | c70 | 0.77 c147 | c72 | 0.84 c147 | c80 | 0.7 c147 | c92 | 0.75 c147 | c95 | 0.72 c147 | c104 | 0.76 c147 | c111 | 0.74 c147 | c124 | 0.73 c147 | c125 | 0.79 c147 | c133 | 0.7 c154 | c0 | 0.73 c154 | c7 | 0.72 c154 | c28 | 0.86 c154 | c72 | 0.74 c154 | c104 | 0.73 c154 | c111 | 0.79 c154 | c147 | 0.81 c155 | c147 | 0.73 c158 | c7 | 0.81 c158 | c67 | 0.71 c158 | c91 | 0.73 c158 | c92 | 0.72 c158 | c104 | 0.7 c158 | c111 | 0.7 c158 | c125 | 0.71 c158 | c147 | 0.85 c158 | c154 | 0.75 c162 | c2 | 0.79 c162 | c9 | 0.76 c162 | c70 | 0.73 c162 | c104 | 0.78 c162 | c109 | 0.73 c162 | c143 | 0.73 c162 | c144 | 0.74 c163 | c21 | 0.71 c163 | c60 | 0.73 c163 | c109 | 0.72 c165 | c43 | 0.75 c165 | c59 | 0.78 c165 | c72 | 0.73 c165 | c73 | 0.71 c165 | c109 | 0.71 c165 | c121 | 0.81 c165 | c125 | 0.77 c165 | c131 | 0.72 c165 | c147 | 0.75 c167 | c7 | 0.76 c167 | c31 | 0.74 c167 | c147 | 0.76 c169 | c16 | 0.7 c169 | c30 | 0.79 c169 | c104 | 0.72 c172 | c7 | 0.71 c172 | c72 | 0.7 c172 | c147 | 0.76 c172 | c154 | 0.7 c172 | c158 | 0.71 c175 | c2 | 0.78 c175 | c3 | 0.74 c175 | c9 | 0.72 c175 | c20 | 0.73 c175 | c30 | 0.73 c175 | c104 | 0.71 c175 | c162 | 0.74 c176 | c58 | 0.72 c176 | c151 | 0.71 c178 | c2 | 0.82 c178 | c3 | 0.76 c178 | c9 | 0.85 c178 | c20 | 0.71 c178 | c25 | 0.73 c178 | c30 | 0.8 c178 | c63 | 0.81 c178 | c67 | 0.76 c178 | c70 | 0.85 c178 | c80 | 0.74 c178 | c92 | 0.7 c178 | c95 | 0.76 c178 | c104 | 0.8 c178 | c109 | 0.75 c178 | c113 | 0.73 c178 | c125 | 0.75 c178 | c140 | 0.72 c178 | c144 | 0.81 c178 | c147 | 0.75 c178 | c169 | 0.72 c178 | c175 | 0.71 c179 | c2 | 0.7 c179 | c64 | 0.76 c179 | c70 | 0.72 c179 | c72 | 0.78 c179 | c88 | 0.7 c179 | c104 | 0.7 c179 | c125 | 0.8 c179 | c147 | 0.82 c179 | c154 | 0.75 c179 | c155 | 0.7 c179 | c158 | 0.8 c179 | c172 | 0.71 c180 | c7 | 0.71 c180 | c67 | 0.73 c180 | c147 | 0.79 c180 | c158 | 0.76 c180 | c179 | 0.72 c182 | c2 | 0.74 c182 | c9 | 0.81 c182 | c43 | 0.73 c182 | c59 | 0.72 c182 | c60 | 0.74 c182 | c61 | 0.71 c182 | c62 | 0.72 c182 | c63 | 0.74 c182 | c70 | 0.78 c182 | c72 | 0.71 c182 | c109 | 0.81 c182 | c110 | 0.71 c182 | c121 | 0.75 c182 | c125 | 0.8 c182 | c144 | 0.72 c182 | c147 | 0.78 c182 | c165 | 0.81 c182 | c178 | 0.79 c185 | c19 | 0.74 c185 | c58 | 0.7 c185 | c129 | 0.75 c185 | c181 | 0.71 c187 | c2 | 0.88 c187 | c3 | 0.71 c187 | c9 | 0.77 c187 | c64 | 0.75 c187 | c70 | 0.82 c187 | c95 | 0.75 c187 | c104 | 0.81 c187 | c109 | 0.73 c187 | c113 | 0.73 c187 | c125 | 0.77 c187 | c144 | 0.73 c187 | c147 | 0.73 c187 | c162 | 0.81 c187 | c164 | 0.76 c187 | c175 | 0.84 c187 | c178 | 0.76 c187 | c179 | 0.78 c187 | c182 | 0.72 c190 | c2 | 0.8 c190 | c9 | 0.84 c190 | c14 | 0.78 c190 | c43 | 0.77 c190 | c50 | 0.71 c190 | c59 | 0.76 c190 | c60 | 0.74 c190 | c61 | 0.77 c190 | c62 | 0.78 c190 | c63 | 0.73 c190 | c70 | 0.79 c190 | c72 | 0.72 c190 | c73 | 0.73 c190 | c104 | 0.73 c190 | c109 | 0.86 c190 | c121 | 0.71 c190 | c125 | 0.78 c190 | c133 | 0.7 c190 | c143 | 0.74 c190 | c144 | 0.74 c190 | c147 | 0.77 c190 | c162 | 0.74 c190 | c165 | 0.78 c190 | c178 | 0.75 c190 | c182 | 0.83 c190 | c187 | 0.8 c191 | c9 | 0.72 c191 | c190 | 0.72 c192 | c23 | 0.72 c192 | c43 | 0.77 c198 | c188 | 0.72 c201 | c9 | 0.71 c201 | c43 | 0.79 c201 | c59 | 0.75 c201 | c61 | 0.72 c201 | c70 | 0.76 c201 | c72 | 0.77 c201 | c125 | 0.72 c201 | c147 | 0.78 c201 | c165 | 0.7 c201 | c182 | 0.75 c201 | c190 | 0.77 c202 | c28 | 0.73 c202 | c59 | 0.71 c202 | c72 | 0.74 c202 | c147 | 0.79 c202 | c154 | 0.8 c202 | c158 | 0.71 c202 | c179 | 0.7 c205 | c6 | 0.76 c205 | c14 | 0.74 c205 | c43 | 0.82 c205 | c54 | 0.72 c205 | c57 | 0.71 c205 | c59 | 0.87 c205 | c61 | 0.75 c205 | c67 | 0.72 c205 | c72 | 0.72 c205 | c73 | 0.76 c205 | c121 | 0.77 c205 | c131 | 0.84 c205 | c133 | 0.83 c205 | c147 | 0.8 c205 | c165 | 0.79 c205 | c182 | 0.74 c205 | c190 | 0.81 c205 | c192 | 0.7 c205 | c201 | 0.77 c207 | c9 | 0.72 c208 | c2 | 0.75 c208 | c7 | 0.73 c208 | c9 | 0.74 c208 | c43 | 0.76 c208 | c59 | 0.76 c208 | c61 | 0.72 c208 | c67 | 0.75 c208 | c70 | 0.81 c208 | c72 | 0.8 c208 | c104 | 0.77 c208 | c125 | 0.78 c208 | c147 | 0.91 c208 | c154 | 0.77 c208 | c158 | 0.79 c208 | c165 | 0.79 c208 | c172 | 0.72 c208 | c178 | 0.77 c208 | c179 | 0.78 c208 | c180 | 0.71 c208 | c182 | 0.81 c208 | c187 | 0.79 c208 | c190 | 0.82 c208 | c201 | 0.78 c208 | c202 | 0.79 c208 | c205 | 0.8 c209 | c9 | 0.71 c209 | c59 | 0.77 c209 | c70 | 0.76 c209 | c73 | 0.79 c209 | c109 | 0.72 c209 | c121 | 0.71 c209 | c125 | 0.74 c209 | c147 | 0.74 c209 | c165 | 0.78 c209 | c182 | 0.75 c209 | c190 | 0.8 c209 | c205 | 0.79 c209 | c208 | 0.76 c213 | c13 | 0.75 c213 | c64 | 0.72 c218 | c176 | 0.77 c218 | c185 | 0.71 c226 | c109 | 0.74 c226 | c190 | 0.74 c226 | c205 | 0.71 c228 | c19 | 0.73 c228 | c58 | 0.72 c228 | c218 | 0.73 c229 | c80 | 0.71 c230 | c6 | 0.75 c230 | c7 | 0.78 c230 | c9 | 0.71 c230 | c43 | 0.85 c230 | c54 | 0.72 c230 | c59 | 0.89 c230 | c61 | 0.75 c230 | c62 | 0.71 c230 | c63 | 0.72 c230 | c67 | 0.79 c230 | c70 | 0.76 c230 | c72 | 0.84 c230 | c73 | 0.76 c230 | c121 | 0.74 c230 | c125 | 0.77 c230 | c131 | 0.81 c230 | c133 | 0.82 c230 | c147 | 0.9 c230 | c154 | 0.73 c230 | c158 | 0.77 c230 | c165 | 0.84 c230 | c172 | 0.76 c230 | c178 | 0.76 c230 | c179 | 0.78 c230 | c180 | 0.77 c230 | c182 | 0.8 c230 | c187 | 0.74 c230 | c190 | 0.82 c230 | c192 | 0.72 c230 | c201 | 0.83 c230 | c202 | 0.77 c230 | c205 | 0.89 c230 | c208 | 0.89 c230 | c209 | 0.78 c236 | c88 | 0.75 c236 | c103 | 0.73 c236 | c106 | 0.72 c236 | c107 | 0.71 c236 | c147 | 0.74 c236 | c158 | 0.78 c236 | c179 | 0.77 c240 | c2 | 0.78 c240 | c9 | 0.79 c240 | c60 | 0.83 c240 | c62 | 0.75 c240 | c63 | 0.73 c240 | c70 | 0.73 c240 | c109 | 0.87 c240 | c125 | 0.72 c240 | c144 | 0.73 c240 | c163 | 0.71 c240 | c165 | 0.7 c240 | c178 | 0.73 c240 | c182 | 0.81 c240 | c190 | 0.83 c240 | c208 | 0.72 c240 | c209 | 0.71 c240 | c226 | 0.74 c242 | c2 | 0.8 c242 | c64 | 0.82 c242 | c70 | 0.74 c242 | c104 | 0.78 c242 | c162 | 0.79 c242 | c175 | 0.8 c242 | c179 | 0.71 c242 | c187 | 0.87 c244 | c7 | 0.75 c244 | c91 | 0.73 c244 | c103 | 0.71 c244 | c107 | 0.7 c244 | c111 | 0.71 c244 | c147 | 0.78 c244 | c158 | 0.79 c244 | c179 | 0.75 c244 | c236 | 0.81 c245 | c2 | 0.76 c245 | c7 | 0.72 c245 | c9 | 0.8 c245 | c43 | 0.75 c245 | c59 | 0.79 c245 | c62 | 0.73 c245 | c63 | 0.74 c245 | c64 | 0.73 c245 | c67 | 0.79 c245 | c70 | 0.83 c245 | c72 | 0.78 c245 | c73 | 0.71 c245 | c80 | 0.72 c245 | c92 | 0.71 c245 | c95 | 0.73 c245 | c104 | 0.75 c245 | c109 | 0.72 c245 | c110 | 0.72 c245 | c113 | 0.76 c245 | c125 | 0.81 c245 | c133 | 0.72 c245 | c144 | 0.74 c245 | c147 | 0.86 c245 | c155 | 0.7 c245 | c158 | 0.73 c245 | c165 | 0.8 c245 | c172 | 0.72 c245 | c178 | 0.82 c245 | c179 | 0.81 c245 | c180 | 0.7 c245 | c182 | 0.81 c245 | c187 | 0.85 c245 | c190 | 0.83 c245 | c201 | 0.79 c245 | c202 | 0.71 c245 | c205 | 0.8 c245 | c207 | 0.73 c245 | c208 | 0.86 c245 | c209 | 0.76 c245 | c230 | 0.91 c245 | c240 | 0.7 c246 | c92 | 0.73 c246 | c111 | 0.72 c246 | c147 | 0.77 c246 | c158 | 0.72 c247 | c7 | 0.82 c247 | c67 | 0.76 c247 | c72 | 0.73 c247 | c147 | 0.86 c247 | c154 | 0.75 c247 | c158 | 0.84 c247 | c167 | 0.72 c247 | c172 | 0.7 c247 | c179 | 0.74 c247 | c180 | 0.74 c247 | c202 | 0.7 c247 | c208 | 0.79 c247 | c230 | 0.84 c247 | c245 | 0.72 c248 | c2 | 0.7 c248 | c7 | 0.77 c248 | c28 | 0.78 c248 | c72 | 0.72 c248 | c88 | 0.75 c248 | c91 | 0.7 c248 | c92 | 0.75 c248 | c104 | 0.78 c248 | c111 | 0.8 c248 | c147 | 0.82 c248 | c154 | 0.78 c248 | c158 | 0.8 c248 | c179 | 0.76 c248 | c187 | 0.74 c248 | c208 | 0.75 c248 | c236 | 0.74 c248 | c242 | 0.71 c248 | c244 | 0.77 c248 | c245 | 0.71 c248 | c246 | 0.78 c248 | c247 | 0.73 c249 | c111 | 0.74 c250 | c147 | 0.76 c250 | c190 | 0.73 c250 | c205 | 0.74 c250 | c208 | 0.73 c250 | c209 | 0.73 c250 | c230 | 0.74 c254 | c208 | 0.74 c254 | c209 | 0.72 c254 | c230 | 0.79 c254 | c245 | 0.75
df_new.shape
(1805, 169)
So, we went from 256 features to 169 features.
X, y = df_new.to_numpy(), df.iloc[:,-1].to_numpy()
X_train, X_test, y_train, y_test = get_train_test_split(X = X, y = y, scale = True)
Cross Validation space
param_grid={'hidden_layer_sizes':[(10,10,10),(15,15,15),(20,20,20),(15,15), (20,20)],
"alpha": [0.00001,0.0001,0.001,0.01]}
# Create the model to be tuned
n_iter = 30
folds = 8
nn_base = MLPClassifier(max_iter=1000)
# Create the random search Random Forest
nn_random = RandomizedSearchCV(estimator = nn_base, param_distributions = param_grid,
n_iter = n_iter, cv = folds, verbose = 2, random_state = 42,
n_jobs = -1)
# Create the grid search
nn_grid = GridSearchCV(estimator = nn_base, param_grid = param_grid,
cv = folds, verbose = 2, n_jobs = -1)
# Shows the confusion matrix
def confusion(model,x,y):
pred = model.predict(x)
fig, ax = plt.subplots(figsize=(10, 5))
ConfusionMatrixDisplay.from_predictions(y, pred, ax=ax)
ax.xaxis.set_ticklabels(["normal","anomaly"])
ax.yaxis.set_ticklabels(["normal","anomaly"])
_ = ax.set_title(
f"Confusion Matrix for {nn.__class__.__name__}"
)
# Fit the random search model
nn_random.fit(X_train, y_train)
print("Best train score ", nn_random.best_score_)
print("Test score ", nn_random.score(X_test, y_test))
# View the best parameters from the random search
print(nn_random.best_params_)
nn_model = nn_random.best_estimator_
nn_model.fit(X_train, y_train)
print("test score after refitting", nn_model.score(X_test, y_test))
Fitting 8 folds for each of 20 candidates, totalling 160 fits
Best train score 0.9411487108655617
Test score 0.9362880886426593
{'hidden_layer_sizes': (20, 20), 'alpha': 1e-05}
test score after refitting 0.9362880886426593
confusion(nn_model,X_test,y_test)
# Fit the random search model
nn_grid.fit(X_train, y_train)
print("Best train score ", nn_grid.best_score_)
print("Test score ", nn_grid.score(X_test, y_test))
# View the best parameters from the random search
print(nn_grid.best_params_)
nn_model = nn_grid.best_estimator_
nn_model.fit(X_train, y_train)
print("test score after refitting", nn_model.score(X_test, y_test))
Fitting 8 folds for each of 20 candidates, totalling 160 fits
Best train score 0.9425145794966237
Test score 0.9362880886426593
{'alpha': 0.0001, 'hidden_layer_sizes': (20, 20)}
test score after refitting 0.9307479224376731
confusion(nn_model,X_test,y_test)
Instead of using correlation, what if we use random 169 columns to compare which does better?
scores, mean, stand_dev = random_NN(X_train, y_train, X_test, y_test,
rand_columns_count = 169, hidden_layer_sizes = (20,20), loop_count = 10)
print("mean score is ", mean)
print("standard deviation is ", stand_dev)
scores
mean score is [1. 0.93379501] standard deviation is [0. 0.00936654]
[[1.0, 0.9418282548476454], [1.0, 0.9307479224376731], [1.0, 0.9390581717451524], [1.0, 0.9335180055401662], [1.0, 0.9113573407202216], [1.0, 0.9445983379501385], [1.0, 0.9362880886426593], [1.0, 0.9390581717451524], [1.0, 0.9279778393351801], [1.0, 0.9335180055401662]]
Correlation gives test score of 93% while random selection gives avg test score of 93% as well. Random selection further performs better at training data!! So, we need to look for better approaches. Also, we need to reduce more than to 169. Let's see if we can do better!
PCA is a dimensionality reduction technique. It reduces the variables while trying to preserve the information asa much as possible. Using eigenvector and eigenvalues, it determines the principal components which are uncorrelated and compress as much information into the first components.
Earlier in the correlation heatmap, we found that the columns are correlated to each other. Hence, we can use PCA to transform the data.
# Train test split with standard scaling
X_train_scaled, X_test_scaled, y_train, y_test = get_train_test_split(scale = True)
# PCA with 95% explained variance ration
pca = PCA(0.95)
pca.fit(X_train_scaled)
PCA(n_components=0.95)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
PCA(n_components=0.95)
# Variance ration coverage by n components
cum_var = np.cumsum(pca.explained_variance_ratio_ * 100)
plt.plot(cum_var)
plt.axhline(y = 95, color = 'r', linestyle = 'dashed')
plt.axvline(x = 92, color = 'g', linestyle = 'dashed')
plt.xlabel('Number of components')
plt.ylabel('Explained Variance')
plt.yticks(sorted(list(range(0,110,10))+[95]));
plt.grid()
plt.show()
print(f"we need {len(pca.components_)} components")
we need 92 components
Hence, the dimension reduced from 256 to 92!!
# Transform the data
x_pca = pca.transform(X_train_scaled)
x_pca.shape
(1444, 92)
nn = MLPClassifier(hidden_layer_sizes= (20,20), alpha = 0.001,max_iter=1000)
nn.fit(x_pca, y_train)
x_test_pca = pca.transform(X_test_scaled)
print("training score is ", nn.score(x_pca,y_train))
print("test score is ",nn.score(x_test_pca,y_test) )
training score is 1.0 test score is 0.9362880886426593
# Confusion matrix
confusion(nn,x_test_pca,y_test)
We will experiment with different n components
n_list = np.linspace(2,X_train.shape[1], 20 ,dtype=int)
n_list
array([ 2, 10, 19, 28, 37, 45, 54, 63, 72, 81, 89, 98, 107,
116, 125, 133, 142, 151, 160, 169])
We check with one hidden layer here
# Stores all the scores
pca_track = []
for n in n_list:
pca = PCA(n)
# Fit the train data
pca.fit(X_train_scaled)
# Transform the train data
x_pca = pca.transform(X_train_scaled)
# Simple NN model
nn = MLPClassifier(hidden_layer_sizes= (20), alpha = 0.01,max_iter=1000)
# Fit the model
nn.fit(x_pca, y_train)
# Train score
train_score = nn.score(x_pca, y_train)
# Test score
x_test_pca = pca.transform(X_test_scaled)
test_score = nn.score(x_test_pca,y_test)
# Append
pca_track.append([n, train_score, test_score])
plt.figure(figsize = (12,8))
plt.plot(n_list, np.array(pca_track)[:,1], marker = "o", linestyle ="--", label = "train", linewidth = "0.5", color = "blue")
plt.plot(n_list, np.array(pca_track)[:,2], marker = "o", linestyle ="-", label = "test", linewidth = "1", color = "red")
plt.xticks(n_list+[2])
plt.legend()
plt.axhline(y = 0.90, color = 'r', linestyle = 'dashed')
plt.xlabel("n components")
plt.ylabel("classification Accuracy")
plt.title("Train vs test accuracy with different PCA n-components using 1 hidden layer")
plt.show()
Compared to correlation approach for dimension reduction (169 columns), PCA does much better with 90 above test accuracy with even as small as 40 components
We check with no hidden layer here
pca_track = []
for n in n_list:
pca = PCA(n)
# Fit the train data
pca.fit(X_train_scaled)
# Transform the train data
x_pca = pca.transform(X_train_scaled)
# Simple NN model
nn = MLPClassifier(hidden_layer_sizes= (), alpha = 0.01,max_iter=1000)
# Fit the model
nn.fit(x_pca, y_train)
# Train score
train_score = nn.score(x_pca, y_train)
# Test score
x_test_pca = pca.transform(X_test_scaled)
test_score = nn.score(x_test_pca,y_test)
# Append
pca_track.append([n, train_score, test_score])
plt.figure(figsize = (12,8))
plt.plot(n_list, np.array(pca_track)[:,1], marker = "o", linestyle ="--", label = "train", linewidth = "0.5", color = "blue")
plt.plot(n_list, np.array(pca_track)[:,2], marker = "o", linestyle ="-", label = "test", linewidth = "1", color = "red")
plt.xticks(n_list+[2])
plt.legend()
plt.axhline(y = 0.90, color = 'r', linestyle = 'dashed')
plt.xlabel("n components")
plt.ylabel("classification Accuracy")
plt.title("Train vs test accuracy with different PCA n-components using no hidden layer")
plt.show()
Even without any hidden layers (essentially a logistic regression), we obtain good performance with as smasll as 47 components
What about random columns??
random_track = []
for n in n_list:
_, mean, stand_dev = random_NN(X_train_scaled, y_train, X_test_scaled, y_test,
rand_columns_count = n, hidden_layer_sizes = (20), loop_count = 10)
# Append
random_track.append([n, mean[0], mean[1]])
plt.figure(figsize = (12,8))
plt.plot(n_list, np.array(random_track)[:,1], marker = "o", linestyle ="--", label = "train", linewidth = "0.5", color = "blue")
plt.plot(n_list, np.array(random_track)[:,2], marker = "o", linestyle ="-", label = "test", linewidth = "1", color = "red")
plt.xticks(n_list+[2])
plt.legend()
plt.axhline(y = 0.90, color = 'r', linestyle = 'dashed')
plt.xlabel("n components")
plt.ylabel("classification Accuracy")
plt.title("Train vs test accuracy with different random selection using one hidden layer")
plt.show()
what's the observation?
Clearly, once we start to reduce the dimension, PCA outperforms the random selection. Now, the question is if there is any way we can outperform PCA itself.
Maybe we can ask PCA does nor work well 10 components or with 2 components.
pca_2 = PCA(2)
pca.fit(X_train_scaled)
x_pca_2 = pca.transform(X_train_scaled)
plt.figure()
sns.scatterplot(x = x_pca_2[:,0], y = x_pca_2[:,1], hue = y_train)
<Axes: >
We can see that the normal images are clustered into 3 groups. However, the anomaly images have no such clusters and scattered around. Hence, model cannot predict the labels just using 2 components or a very few components.
To solve this issue, we introduce the idea of gaussian/normal distribution. Using normal distribution, we can make comapre how far from the normal distibution each data point is.
Here, we use the mean and covariance to make a simple classifier.
Normal data to use for mean and cov
# covariance
_cov = np.cov(normal_data, rowvar=False)
# Mean
mean = np.mean(normal_data, axis=0)
# Inverse of covariance
cov_inv = np.linalg.inv(_cov)
Distance of normal data from Gaussian distribution using mahalanobis distance
dist_normal = [mahalanobis(_img, mean, cov_inv) for _img in normal_data]
dist_anomaly = [mahalanobis(_img, mean, cov_inv) for _img in anomaly_data]
We plot how the deviations for normal data and anomaly data distributed. Normal images should have relatively smaller distance whereas anomaly should have larger.
plt.hist(dist_normal,
alpha=0.5, # the transaparency parameter
label='Normal data')
plt.hist(dist_anomaly,
alpha=0.5,
label='Anomaly data')
plt.legend(loc='upper right')
plt.title('Histogram of normal and anomaly data')
plt.xlabel("Mahalanobis distance from Gaussian distribution")
plt.ylabel("Frequency")
plt.show()
We can see that there is overlapping region. Hence, a simple classifier will not be able to classify just using a distance threshold.
Find the lower and upper bound to find the outlier
q1, q3= np.percentile(dist_normal,[25,75])
iqr = q3 - q1
lower_bound = q1 -(1.5 * iqr)
upper_bound = q3 +(1.5 * iqr)
print("lower bound is ", lower_bound)
print("upper bound is ", upper_bound)
lower bound is 9.04635286010004 upper bound is 22.595015689220833
def is_normal(distance, lower_bound = lower_bound, upper_bound = upper_bound):
if lower_bound < distance < upper_bound:
return True
else:
return False
For normal data
correct = 0
incorrect = 0
for _dist in dist_normal:
is_norm = is_normal(_dist)
if is_norm:
correct+=1
else:
incorrect+=1
normal_classification_accuracy = round(correct/(correct + incorrect), 4)
For anomaly data
correct = 0
incorrect = 0
for _dist in dist_anomaly:
is_norm = is_normal(_dist)
if is_norm:
incorrect+=1
else:
correct+=1
anomaly_classification_accuracy = round(correct/(correct + incorrect), 4)
bar_labels = ["normal data", "anomaly data"]
bar_vals = [normal_classification_accuracy, anomaly_classification_accuracy]
plt.bar(bar_labels,bar_vals)
for i,j in zip(bar_labels, bar_vals):
plt.text(i,j, str(j))
plt.xlabel("data class")
plt.ylabel(" Classification accuracy")
plt.title("Comparision of classification accuracy using Mahalobinas distance")
Text(0.5, 1.0, 'Comparision of classification accuracy using Mahalobinas distance')
So, this simple model is not a good one.
Possible issue could be:
pca = PCA(0.95)
pca.fit(normal_data)
components_count = len(pca.components_)
components_count
52
_normal_pca = pca.transform(normal_data)
_anomaly_pca = pca.transform(anomaly_data)
_cov = np.cov(_normal_pca, rowvar=False)
mean = np.mean(_normal_pca, axis=0)
cov_inv = np.linalg.inv(_cov)
dist_normal = [mahalanobis(_img, mean, cov_inv) for _img in _normal_pca]
dist_anomaly = [mahalanobis(_img, mean, cov_inv) for _img in _anomaly_pca]
plt.hist(dist_normal,
alpha=0.5, # the transaparency parameter
label='Distance distribution of normal data')
plt.hist(dist_anomaly,
alpha=0.5,
label='Distance distribution of anomaly data')
plt.legend(loc='upper right')
plt.title('Overlapping histogram of normal and anomaly data with both alpha=0.5')
plt.xlabel("Mahalanobis distance from Gaussian distribution")
plt.ylabel("Frequency")
plt.show()
Just from the histogram, we can see that the distribution overlaps more after using PCA
Maybe we cannot classify the rows just based on single distance. So, we need multi dimensional approach to further make better classification.
We can consider the mean of each column from normal data as a mean representation of normal images. So, we calculate the Gaussian distribution of normal images. Then, we find stanard deviation of each column.
We convert each row into z-score form using mean and variance.
from scipy import stats
np.set_printoptions(suppress=False)
# Find mean and covariance and standard deviation
_cov = np.cov(normal_data, rowvar=False)
mean = np.mean(normal_data, axis=0)
stand_dev = [_cov[i][i] for i in range(_cov.shape[0])]
stand_dev = np.sqrt(stand_dev)
Converts the array into z-score form
def z_transform(arr,mean,sd):
# Difference with mean (x - x_bar)
diff = (arr - mean)
# Divide by standard deviation (x-x_bar)/sd
z_score = diff/sd
return z_score
# Transorm the normal and abomaly
normal_z_z = z_transform(normal_data, mean, stand_dev)
anomaly_z_z= z_transform(anomaly_data, mean, stand_dev)
here, we check the histogram of few columns after z transform
fig, ax = plt.subplots(10,2, figsize=(8,40))
indices =random.sample(range(255),10)
for i,t in enumerate(indices):
ax[i][0].hist(normal_z_z[:,t])
ax[i][1].hist(normal_z_z[:,t])
ax[i][0].set_title(f"col_{t} normal")
ax[i][1].set_title(f"col_{t} anomaly")
Few assumptions:
Preparing the dataset
X = np.concatenate((normal_z_z, anomaly_z_z), axis=0)
y = [0] * normal_z_z.shape[0] + [1] * anomaly_z_z.shape[0]
X_train, X_test, y_train, y_test = get_train_test_split(X =X, y = y)
exp_variance = 0.9
pca = PCA(exp_variance)
pca.fit(X_train)
print(f"n components is {len(pca.components_)}")
n components is 53
cum_var = np.cumsum(pca.explained_variance_ratio_ * 100)
plt.plot(cum_var)
plt.axhline(y = exp_variance*100, color = 'r', linestyle = 'dashed')
plt.axvline(x = len(pca.components_), color = 'g', linestyle = 'dashed')
plt.xlabel('Number of components')
plt.ylabel('Explained Variance')
plt.yticks(sorted(list(range(0,110,10))+[95]));
plt.grid()
plt.show()
train_pca = pca.transform(X_train)
test_pca = pca.transform(X_test)
nn = MLPClassifier(hidden_layer_sizes= (20,20), alpha = 0.01,max_iter=1000)
nn.fit(train_pca, y_train)
print("Training score is", nn.score(train_pca, y_train))
print("Test score is", nn.score(test_pca,y_test))
Training score is 1.0 Test score is 0.9141274238227147
Another approach is selecting the top k columns contributing the principal components. This way we can get actual columns rather than PCA transformed features.
# returns top 5 columns for a given order of principal compoenents
def top_k(a,k):
return sorted(range(len(a)), key=lambda i: a[i], reverse= True)[:k]
# We use the first column of each principal component
most_important = [top_k(np.abs(pca.components_[i]),1) for i in range(len(pca.components_))]
most_important = sum(most_important,[])
most_important = list(set(most_important))
most_important = sorted(most_important)
len(most_important)
43
Use the columns obtained above
X_train, X_test, y_train, y_test = get_train_test_split(X = X[:,most_important], y = y)
nn = MLPClassifier(hidden_layer_sizes= (20, 20), alpha = 0.01,max_iter=1000)
nn.fit(X_train, y_train)
print("Training score is", nn.score(X_train, y_train))
print("Test score is", nn.score(X_test,y_test))
Training score is 1.0 Test score is 0.8614958448753463
rand_cols = random.sample(range(255),len(most_important))
rand_cols = sorted(rand_cols)
X_train, X_test, y_train, y_test = get_train_test_split(X = X[:,rand_cols], y = y)
nn = MLPClassifier(hidden_layer_sizes= (20, 20), alpha = 0.01,max_iter=2000)
nn.fit(X_train, y_train)
print("Training score is", nn.score(X_train, y_train))
print("Test score is", nn.score(X_test,y_test))
Training score is 1.0 Test score is 0.8698060941828255
Here, we repeat the experiment using the same random columns but from the original dataset(without my custom transformation)
X = np.concatenate((normal_data, anomaly_data), axis=0)
y = [0] * normal_data.shape[0] + [1] * anomaly_data.shape[0]
X_train, X_test, y_train, y_test = get_train_test_split(X = X[:,rand_cols], y = y)
nn = MLPClassifier(hidden_layer_sizes= (20, 20), alpha = 0.01,max_iter=1000)
nn.fit(X_train, y_train)
print("Training score is", nn.score(X_train, y_train))
print("Test score is", nn.score(X_test,y_test))
Training score is 0.8864265927977839 Test score is 0.8919667590027701
So, using random columns from transformed data, we get better performance than using the same columns from original data.
So, it is the work of the transformation. However, we could also do normal scaling and check if it's the same.
# Applying scaling before using PCA
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
nn = MLPClassifier(hidden_layer_sizes= (20, 20), alpha = 0.01,max_iter=1000)
nn.fit(X_train_scaled, y_train)
print("Training score is", nn.score(X_train_scaled, y_train))
print("Test score is", nn.score(X_test_scaled,y_test))
Training score is 1.0 Test score is 0.8836565096952909
However, even with normal scaling, we get the same performance as with z-score transformation.
To get the big picture, we will try making comparisions with different sizes of neural network and different columns or transformations.
Compare regular pick from regular, standard-scalar, z-transformed, PCA-z-transform
Data without any trasformation
X_train, X_test, y_train, y_test = get_train_test_split(rand_int = 100)
Standard Scaling on regular data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
Z-score transformation
X = np.concatenate((normal_z_z, anomaly_z_z), axis=0)
y = [0] * normal_z_z.shape[0] + [1] * anomaly_z_z.shape[0]
X_train_Z, X_test_Z, y_train_Z, y_test_Z = get_train_test_split(X= X, y = y, rand_int = 100)
Scaling the Z-score data
scaler_Z = StandardScaler()
scaler_Z.fit(X_train_Z)
X_train_scaled_Z = scaler.transform(X_train_Z)
X_test_scaled_Z = scaler.transform(X_test_Z)
Warning!! this cell has large runtime
# Stores everything
results_dict = {}
hidden_layer_sizes = [(20,20),(20),(5),(1),()]
variance_percent = [0.95, 0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 6, 0.55, 0.5, 0.45, 0.4]
for hidden_layer_size in tqdm(hidden_layer_sizes):
pca_track = []
pca_normal = []
random_track_Z = []
random_track_Z_scaled = []
random_track = []
random_scaled = []
for _percent in tqdm(variance_percent):
# Create pca transform
pca = PCA(_percent)
# Fit the train data
pca.fit(X_train_Z)
components_count = len(pca.components_)
# Transform the train data
x_pca = pca.transform(X_train_Z)
# Simple NN model
nn = MLPClassifier(hidden_layer_sizes= hidden_layer_size, alpha = 0.01,max_iter=1000)
# Fit the model
nn.fit(x_pca, y_train_Z)
# Train score
train_score = nn.score(x_pca, y_train_Z)
# Test score
x_test_pca = pca.transform(X_test_Z)
test_score = nn.score(x_test_pca,y_test_Z)
# Append
pca_track.append([components_count,train_score, test_score])
# PCA with normal data
pca = PCA(components_count)
# Fit the train data
pca.fit(X_train_scaled)
# Transform the train data
x_pca = pca.transform(X_train_scaled)
# Simple NN model
nn = MLPClassifier(hidden_layer_sizes= hidden_layer_size, alpha = 0.01,max_iter=1000)
# Fit the model
nn.fit(x_pca, y_train)
# Train score
train_score = nn.score(x_pca, y_train)
# Test score
x_test_pca = pca.transform(X_test_scaled)
test_score = nn.score(x_test_pca,y_test)
# Append
pca_normal.append([components_count, train_score, test_score])
random.seed(2)
# Random cols
rand_cols = random.sample(range(255),components_count)
rand_cols = sorted(rand_cols)
# Random pick from z-score data
# NN model
nn = MLPClassifier(hidden_layer_sizes= hidden_layer_size, alpha = 0.01,max_iter=1000)
# Train the model
nn.fit(X_train_Z[:,rand_cols], y_train_Z)
# Train score
train_score = nn.score(X_train_Z[:,rand_cols], y_train_Z)
# test score
test_score = nn.score(X_test_Z[:,rand_cols],y_test_Z)
# Append
random_track_Z.append([components_count, train_score, test_score])
# Random pick from scaled z-score data
# NN model
nn = MLPClassifier(hidden_layer_sizes= hidden_layer_size, alpha = 0.01,max_iter=1000)
# Train the model
nn.fit(X_train_scaled_Z[:,rand_cols], y_train_Z)
# Train score
train_score = nn.score(X_train_scaled_Z[:,rand_cols], y_train_Z)
# test score
test_score = nn.score(X_test_scaled_Z[:,rand_cols],y_test_Z)
# Append
random_track_Z_scaled.append([components_count, train_score, test_score])
# Random pick from regular data
# NN model
nn = MLPClassifier(hidden_layer_sizes= hidden_layer_size, alpha = 0.01,max_iter=1000)
# Train the model
nn.fit(X_train[:,rand_cols], y_train)
# Train score
train_score = nn.score(X_train[:,rand_cols], y_train)
# test score
test_score = nn.score(X_test[:,rand_cols],y_test)
# Append
random_track.append([components_count, train_score, test_score])
# Random pick from standard scaled regular data
# NN model
nn = MLPClassifier(hidden_layer_sizes= hidden_layer_size, alpha = 0.01,max_iter=1000)
# Train on scaled data
nn.fit(X_train_scaled[:,rand_cols], y_train)
train_score = nn.score(X_train_scaled[:,rand_cols], y_train)
test_score = nn.score(X_test_scaled[:,rand_cols],y_test)
random_scaled.append([components_count, train_score, test_score])
results_dict[hidden_layer_size] = [pca_track, pca_normal, random_track_Z, random_track_Z_scaled, random_track, random_scaled ]
0%| | 0/5 [00:00<?, ?it/s] 0%| | 0/12 [00:00<?, ?it/s] 8%|▊ | 1/12 [00:04<00:47, 4.30s/it] 17%|█▋ | 2/12 [00:10<00:51, 5.18s/it] 25%|██▌ | 3/12 [00:18<01:00, 6.77s/it] 33%|███▎ | 4/12 [00:28<01:04, 8.11s/it] 42%|████▏ | 5/12 [00:37<00:58, 8.30s/it] 50%|█████ | 6/12 [00:47<00:52, 8.79s/it] 58%|█████▊ | 7/12 [00:55<00:42, 8.48s/it] 67%|██████▋ | 8/12 [01:01<00:31, 7.86s/it] 75%|███████▌ | 9/12 [01:06<00:21, 7.02s/it] 83%|████████▎ | 10/12 [01:10<00:12, 6.07s/it] 92%|█████████▏| 11/12 [01:13<00:05, 5.15s/it] 100%|██████████| 12/12 [01:15<00:00, 6.32s/it] 20%|██ | 1/5 [01:15<05:03, 75.90s/it] 0%| | 0/12 [00:00<?, ?it/s] 8%|▊ | 1/12 [00:06<01:09, 6.36s/it] 17%|█▋ | 2/12 [00:14<01:13, 7.39s/it] 25%|██▌ | 3/12 [00:23<01:14, 8.31s/it] 33%|███▎ | 4/12 [00:31<01:03, 7.95s/it] 42%|████▏ | 5/12 [00:37<00:50, 7.18s/it] 50%|█████ | 6/12 [00:42<00:38, 6.43s/it] 58%|█████▊ | 7/12 [00:45<00:27, 5.46s/it] 67%|██████▋ | 8/12 [00:47<00:17, 4.40s/it] 75%|███████▌ | 9/12 [00:50<00:11, 3.78s/it] 83%|████████▎ | 10/12 [00:51<00:06, 3.17s/it] 92%|█████████▏| 11/12 [00:53<00:02, 2.70s/it] 100%|██████████| 12/12 [00:54<00:00, 4.54s/it] 40%|████ | 2/5 [02:10<03:10, 63.34s/it] 0%| | 0/12 [00:00<?, ?it/s] 8%|▊ | 1/12 [00:08<01:31, 8.32s/it] 17%|█▋ | 2/12 [00:15<01:15, 7.55s/it] 25%|██▌ | 3/12 [00:19<00:54, 6.09s/it] 33%|███▎ | 4/12 [00:23<00:41, 5.13s/it] 42%|████▏ | 5/12 [00:26<00:29, 4.25s/it] 50%|█████ | 6/12 [00:28<00:22, 3.75s/it] 58%|█████▊ | 7/12 [00:31<00:16, 3.31s/it] 67%|██████▋ | 8/12 [00:33<00:11, 2.86s/it] 75%|███████▌ | 9/12 [00:34<00:07, 2.34s/it] 83%|████████▎ | 10/12 [00:36<00:04, 2.21s/it] 92%|█████████▏| 11/12 [00:38<00:02, 2.18s/it] 100%|██████████| 12/12 [00:39<00:00, 3.29s/it] 60%|██████ | 3/5 [02:49<01:44, 52.46s/it] 0%| | 0/12 [00:00<?, ?it/s] 8%|▊ | 1/12 [00:06<01:06, 6.03s/it] 17%|█▋ | 2/12 [00:11<00:57, 5.76s/it] 25%|██▌ | 3/12 [00:16<00:47, 5.31s/it] 33%|███▎ | 4/12 [00:18<00:33, 4.13s/it] 42%|████▏ | 5/12 [00:21<00:26, 3.74s/it] 50%|█████ | 6/12 [00:23<00:18, 3.09s/it] 58%|█████▊ | 7/12 [00:25<00:13, 2.72s/it] 67%|██████▋ | 8/12 [00:27<00:09, 2.48s/it] 75%|███████▌ | 9/12 [00:29<00:07, 2.37s/it] 83%|████████▎ | 10/12 [00:31<00:04, 2.17s/it] 92%|█████████▏| 11/12 [00:32<00:01, 1.98s/it] 100%|██████████| 12/12 [00:33<00:00, 2.80s/it] 80%|████████ | 4/5 [03:23<00:45, 45.04s/it] 0%| | 0/12 [00:00<?, ?it/s] 8%|▊ | 1/12 [00:04<00:51, 4.68s/it] 17%|█▋ | 2/12 [00:08<00:38, 3.90s/it] 25%|██▌ | 3/12 [00:11<00:31, 3.49s/it] 33%|███▎ | 4/12 [00:12<00:20, 2.61s/it] 42%|████▏ | 5/12 [00:13<00:14, 2.11s/it] 50%|█████ | 6/12 [00:14<00:10, 1.71s/it] 58%|█████▊ | 7/12 [00:15<00:07, 1.45s/it] 67%|██████▋ | 8/12 [00:16<00:04, 1.24s/it] 75%|███████▌ | 9/12 [00:16<00:03, 1.07s/it] 83%|████████▎ | 10/12 [00:17<00:01, 1.04it/s] 92%|█████████▏| 11/12 [00:18<00:00, 1.01it/s] 100%|██████████| 12/12 [00:19<00:00, 1.62s/it] 100%|██████████| 5/5 [03:43<00:00, 44.60s/it]
size0 = "0.5"
size1 = "1"
size2 = "2"
for key in results_dict.keys():
pca_track,pca_normal, random_track_Z, random_track_Z_scaled, random_track, random_scaled = results_dict[key]
n = np.array(random_track)[:,0]
plt.figure()
plt.figure(figsize = (12,10))
# PCA
plt.plot(n, np.array(pca_track)[:,1], marker = ".", linestyle ="--", label = "pca on Z train", linewidth = size0, color = "red")
plt.plot(n, np.array(pca_track)[:,2], marker = ".", linestyle ="-", label = "pca on Z test", linewidth = size2, color = "red")
# Column selection using PCA
plt.plot(n, np.array(pca_normal)[:,1], marker = ".", linestyle ="--", label = "pca on original data train", linewidth = size0, color = "purple")
plt.plot(n, np.array(pca_normal)[:,2], marker = ".", linestyle ="-", label = "pca on original data test", linewidth = size2, color = "purple")
# Random selection from z-score
plt.plot(n, np.array(random_track_Z)[:,1], marker = "*", linestyle ="--", label = "random Z train", linewidth = size0, color = "green")
plt.plot(n, np.array(random_track_Z)[:,2], marker = "*", linestyle ="-", label = "random Z test", linewidth = size1, color = "green")
# Random selection from scaled z-score
plt.plot(n, np.array(random_track_Z_scaled)[:,1], marker = "s", linestyle ="--", label = "random scaled Z train", linewidth = size0, color = "orange")
plt.plot(n, np.array(random_track_Z_scaled)[:,2], marker = "s", linestyle ="-", label = "random scaled Z test", linewidth = size1, color = "orange")
# Random selection from regular
plt.plot(n, np.array(random_track)[:,1], marker = "o", linestyle ="--", label = "random train", linewidth = size0, color = "blue")
plt.plot(n, np.array(random_track)[:,2], marker = "o", linestyle ="-", label = "random test", linewidth = size1, color = "blue")
# Random selection from standard scalar
plt.plot(n, np.array(random_scaled)[:,1], marker = "^", linestyle ="--", label = "random scaled train", linewidth = size0, color = "black")
plt.plot(n, np.array(random_scaled)[:,2], marker = "^", linestyle ="-", label = "random scaled test", linewidth = size1, color = "black")
plt.legend()
plt.axhline(y = 0.90, color = 'r', linestyle = 'dashed', linewidth = "0.5")
# plt.axvline(x = 30, color = 'r', linestyle = 'dashed', linewidth = "0.5")
plt.xlabel("n components")
plt.ylabel("Classification accuracy")
# plt.xticks(n);
plt.title(f"Random vs PCA on original and transformed columns with NN shape {key}")
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
From the plot we can see that, using PCA on z-transformed data and original data, we can get similar performance and much better than random selection approach. Hence, we can ignore using z-transformation and just opt for applying PCA on original data.
Another observation we make is that number of hidden layers do not influence the performance. The last plot with no hidden layer proves that. So, the best wa
X_train_scaled, X_test_scaled, y_train, y_test = get_train_test_split(rand_int = 10, scale = True)
clf = LogisticRegression(max_iter = 2000).fit(X_train_scaled, y_train)
clf.score(X_test_scaled, y_test)
0.9556786703601108
# Dictionary of column and its coefficient
coef_dict = dict(zip(df.columns[:-1],clf.coef_[0]))
# Ordering the dictionary based on coefficient value
coef_dict = {k: v for k, v in sorted(coef_dict.items(), key=lambda item: abs(item[1]), reverse=True)}
n_col = 10
new_cols_names = list(coef_dict.keys())[:n_col]
new_cols_names
['c211', 'c99', 'c50', 'c37', 'c83', 'c212', 'c22', 'c46', 'c144', 'c121']
title = "Coefficient weights"
x = list(coef_dict.keys())[:n_col]
y = list(coef_dict.values())[:n_col]
plt.scatter(x,y)
for i, label in enumerate(x):
plt.annotate(str(label), (x[i],y[i]) , rotation=60,size=8)
plt.xticks(rotation = 45) # Rotates X-Axis Ticks by 45-degrees
plt.xticks("")
plt.grid()
plt.title(title, {'fontsize': 15},pad=30 )
plt.show()
# Dictionary of column and its coefficient
coef_dict = dict(zip(range(0,X_train_scaled.shape[1]),clf.coef_[0]))
# Ordering the dictionary based on coefficient value
coef_dict = {k: v for k, v in sorted(coef_dict.items(), key=lambda item: abs(item[1]), reverse=True)}
Comparisions for different column length
n_cols = [50,45,40,35,30,25,20,15,10,5]
pca_scores = []
lr_coe_scores = []
for n_col in n_cols:
# Slice the columns
new_cols_index = list(coef_dict.keys())[:n_col]
# PCA with normal data
pca = PCA(n_col)
pca_score = []
lr_score = []
# We store average score from 10 random train test split
for i in np.random.randint(10000,size = 10):
X_train_scaled, X_test_scaled, y_train, y_test = get_train_test_split(rand_int = i, scale = True)
# Fit the train data
pca.fit(X_train_scaled)
# Transform the train data
x_pca = pca.transform(X_train_scaled)
# Simple NN model
lr = LogisticRegression()
# Fit the model
lr.fit(x_pca, y_train)
# Train score
train_score = lr.score(x_pca, y_train)
# Test score
x_test_pca = pca.transform(X_test_scaled)
test_score = lr.score(x_test_pca,y_test)
pca_score.append([train_score, test_score])
# Logistic regression using columns given by using coefficiets
clf = LogisticRegression().fit(X_train_scaled[:,new_cols_index], y_train)
train_score = clf.score(X_train_scaled[:,new_cols_index], y_train)
test_score = clf.score(X_test_scaled[:,new_cols_index], y_test)
lr_score.append([train_score, test_score])
#Sore the average values
train_score, test_score = np.array(pca_score).mean(axis = 0)
pca_scores.append([n_col,train_score, test_score])
train_score, test_score = np.array(lr_score).mean(axis = 0)
lr_coe_scores.append([n_col,train_score, test_score])
n = np.array(pca_scores)[:,0]
plt.plot(n, np.array(pca_scores)[:,1], marker = ".", linestyle ="--", label = "PCA LR train", linewidth = "0.5", color = "purple")
plt.plot(n, np.array(pca_scores)[:,2], marker = ".", linestyle ="-", label = "PCA LR test", linewidth = "1", color = "purple")
plt.plot(n, np.array(lr_coe_scores)[:,1], marker = "o", linestyle ="--", label = "LR coeffiecient selected train", linewidth = "0.5", color = "blue")
plt.plot(n, np.array(lr_coe_scores)[:,2], marker = "o", linestyle ="-", label = "LR coeffiecient selected test", linewidth = "1", color = "blue")
plt.axhline(y = 0.9, color = 'black', linestyle = 'dashed', linewidth = "0.5")
plt.legend()
<matplotlib.legend.Legend at 0x1fc94749790>
So, in average we can get 90% classification accuracy with just 40 columns.
print(list(coef_dict.keys())[:40])
[211, 99, 50, 37, 83, 212, 22, 46, 144, 121, 122, 206, 215, 169, 123, 138, 58, 172, 0, 132, 103, 240, 194, 24, 4, 192, 202, 35, 207, 146, 125, 231, 55, 143, 184, 49, 128, 181, 8, 23]
These are the columns index in the dataframe!!
# We do cross-validation on whole data set
X = np.concatenate((normal_data, anomaly_data), axis=0)
y = [0] * normal_data.shape[0] + [1] * anomaly_data.shape[0]
# Scaling
scaler = StandardScaler()
scaler.fit(X)
# Transform
X = scaler.transform(X)
# Select the columns
new_cols_index = list(coef_dict.keys())[:40]
model = LogisticRegression()
Accuracy
# For anomaly data
is_anomaly = (np.array(y) == 1)
cross_val_score(model, X, is_anomaly,
cv=10, scoring="accuracy").mean()
0.9069275629220381
# For normal data
is_anomaly = (np.array(y) == 0)
cross_val_score(model, X, is_anomaly,
cv=10, scoring="accuracy").mean()
0.9069275629220381
Recall
# For anomaly data
is_anomaly = (np.array(y) == 1)
cross_val_score(model, X, is_anomaly,
cv=10, scoring="recall").mean()
0.7545454545454546
# For normal data
is_anomaly = (np.array(y) == 0)
cross_val_score(model, X, is_anomaly,
cv=10, scoring="recall").mean()
0.9560057964791756
precision
# For anomaly data
is_anomaly = (np.array(y) == 1)
cross_val_score(model, X, is_anomaly,
cv=10, scoring="precision").mean()
0.8519167426883248
# For normal data
is_anomaly = (np.array(y) == 0)
cross_val_score(model, X, is_anomaly,
cv=10, scoring="precision").mean()
0.9274925113227367
F1 score
# For anomaly data
is_anomaly = (np.array(y) == 1)
cross_val_score(model, X, is_anomaly,
cv=10, scoring="f1").mean()
0.7830255596091451
# For normal data
is_anomaly = (np.array(y) == 0)
cross_val_score(model, X, is_anomaly,
cv=10, scoring="f1").mean()
0.940144619716649
There are few reasons we can think of why there is such discrepancy in mainly recall. First, there are many ways anomaly can occur compared to just one form of normality. On top of that, there is data imbalance with lesser anomaly data. This leads to high number of False negatives for anomaly class.
We compared the performance of different feature reduction approaches including correlation, PCA random and regression coefficient. We also applied feature transformation like scaling and z-score transformation. We tried simple classifier like detecting outlier using gaussian distribution.
After different analysis, we conclude that even for complex raw image dataset, if we use pretrained layer for feature extraction, we can construct a simple logistic regression binary classifier to classify normal and anomaly images. There is no need of advanced complex CNN or deep neural network from scratch.
At the same time, we can reduce the column size to as many as 35 out of 256 columns to get average test score of 90%!!
This is more explainable approach since we can now check and visit each channel in the pretrained layer to see what particular kernel activates to have better understanding of anomality.